Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization
نویسندگان
چکیده
To integrate or link the data stored in heterogeneous data sources, a critical problem is entity matching, i.e., matching records representing semantically corresponding entities in the real world, across the sources. While decision tree techniques have been used to learn entity matching rules, most decision tree learners have an inherent representational bias, that is, they generate univariate trees and restrict the decision boundaries to be axisorthogonal hyper-planes in the feature space. Cascading other classification methods with decision tree learners can alleviate this bias and potentially increase classification accuracy. In this paper, the authors apply a recently-developed constrained cascade generalization method in entity matching and report on empirical evaluation using real-world data. The evaluation results show that this method outperforms the base classification methods in terms of classification accuracy, especially in the dirtiest case. 2008 Elsevier B.V. All rights reserved.
منابع مشابه
Liquidity Demand of Heterogeneous Firms in the Economy of Iran: Microfoundation Approach
One of the main reasons in economic recession is the liquidity constraints of economic firms. When external finance becomes expensive for firms, money demand will act as a buffer stock. Actually, firms reduce unfavorable effect of liquidity constraints by holding money. In this study, after extracting firms liquidity demand function based on micro foundations and creating an index for firms liq...
متن کاملMatching Attributes across Overlapping Heterogeneous Data Sources Using Mutual Information
Identifying matching attributes across heterogeneous data sources is a critical and time-consuming step in integrating the data sources. In this paper, the author proposes a method for matching the most frequently encountered types of attributes across overlapping heterogeneous data sources. The author uses mutual information as a unified measure of dependence on various types of attributes. An...
متن کاملEntity Matching across Heterogeneous Sources
Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. Traditionally, the problem was usually addressed by first extracting major keywords corresponding to the source entity and then query relevant entities from the target domain using those keywords. However, the method would inevitably fails if the two domains h...
متن کاملHeuristic joins to integrate structured heterogeneous data
Heterogeneous data sources often exhibit semantic heterogeneity at the data level; that is, the same entity in the world is referred to in different ways both within and across sources. This paper discusses a framework for combining information from such sources, called heuristic join, that is an extension of the familiar equi-join for homogeneous sources. Heuristic join uses heuristic match op...
متن کاملContext-aware Modeling for Spatio-temporal Data Transmitted from a Wireless Body Sensor Network
Context-aware systems must be interoperable and work across different platforms at any time and in any place. Context data collected from wireless body area networks (WBAN) may be heterogeneous and imperfect, which makes their design and implementation difficult. In this research, we introduce a model which takes the dynamic nature of a context-aware system into consideration. This model is con...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Data Knowl. Eng.
دوره 66 شماره
صفحات -
تاریخ انتشار 2008